New York Police Department cares about citizens of New York and NYPD hired Data Scientist, who will provide answer for following question:
How can we improve New York City in order to reduce vehicle collsions and improve safety on roads?
Answer can be provided with comprehensive data analysis which enables to understand main factors that cause vehicle crashes.
Aim of this project is to investigate main contributions that have an affect on the safety on roads by providing answers for following questions:
1. When collisions happen the most frequently? Is it correlated with specific months, weekdays or hours?
2. Where collisions appear the most frequently? Which spots are the most dangerous ones?
3. What are the main factors that contribute collisions in NYC?
dataNYC <- readRDS("data_NYC.rds")
glimpse(dataNYC)
## Observations: 1,185,165
## Variables: 30
## $ borough <chr> "QUEENS", NA, "MANHATTAN", "QUEE...
## $ contributing_factor_vehicle_1 <chr> "Fatigued/Drowsy", "Driver Inatt...
## $ contributing_factor_vehicle_2 <chr> "Unspecified", "Unspecified", "U...
## $ date <dttm> 2012-10-21, 2012-10-21, 2012-10...
## $ latitude <chr> "40.6869122", "40.8493533", "40....
## $ location.type <chr> "Point", "Point", "Point", "Poin...
## $ location.coordinates <list> [<-73.79437, 40.68691>, <-73.87...
## $ longitude <chr> "-73.7943714", "-73.8711899", "-...
## $ number_of_cyclist_injured <chr> "0", "0", "0", "0", "0", "1", "0...
## $ number_of_cyclist_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_motorist_injured <chr> "2", "3", "0", "0", "0", "0", "0...
## $ number_of_motorist_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_pedestrians_injured <chr> "0", "0", "0", "0", "0", "0", "2...
## $ number_of_pedestrians_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_persons_injured <chr> "2", "3", "0", "0", "0", "1", "2...
## $ number_of_persons_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ off_street_name <chr> "113 AVENUE ...
## $ on_street_name <chr> "SUTPHIN BOULEVARD ...
## $ time <chr> "11:30", "11:30", "11:50", "11:5...
## $ unique_key <chr> "268825", "2929446", "52928", "2...
## $ vehicle_type_code1 <chr> "SPORT UTILITY / STATION WAGON",...
## $ vehicle_type_code2 <chr> "SPORT UTILITY / STATION WAGON",...
## $ zip_code <chr> "11435", NA, "10023", "11377", N...
## $ contributing_factor_vehicle_3 <chr> NA, "Unspecified", NA, NA, NA, N...
## $ vehicle_type_code_3 <chr> NA, "PASSENGER VEHICLE", NA, NA,...
## $ cross_street_name <chr> NA, NA, NA, NA, "PARKING LOT OF ...
## $ contributing_factor_vehicle_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contributing_factor_vehicle_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
The vehicle collision data was collected by the New York Police Department and published by NYC OpenData.
Source: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95
The motor vehicle collision dataset includes the date and time, location (as borough, street names, zip code and latitude and longitude coordinates), injuries and fatalities, vehicle number and types, and related factors for almost 1.2 million collisions in New York City during 5 years (from 1st July 2012 until 31st December 2017).
Each observation contains is related to 1 collision.
Aim of this section is to improve quality of data. Columns will be renamed, data-types need to be checked and values of variables must be checked.
data=dataNYC
names(data) <- gsub("number_of_", "", names(data), fixed = TRUE)
names(data) <- gsub("_", ".", names(data), fixed = TRUE)
names(data) <- gsub("code1", "CODE.1", names(data), fixed = TRUE)
names(data) <- gsub("code2", "CODE.2", names(data), fixed = TRUE)
names(data) <- gsub("code3", "CODE.3", names(data), fixed = TRUE)
colnames(data)[c(20, 23, 7, 6)] <-c("UNIQUE.KEY", "ZIP.CODE", "LOCATION.COORDINATES", "LOCATION.TYPE")
names(data) <- toupper(names(data))
colnames(data)
## [1] "BOROUGH" "CONTRIBUTING.FACTOR.VEHICLE.1"
## [3] "CONTRIBUTING.FACTOR.VEHICLE.2" "DATE"
## [5] "LATITUDE" "LOCATION.TYPE"
## [7] "LOCATION.COORDINATES" "LONGITUDE"
## [9] "CYCLIST.INJURED" "CYCLIST.KILLED"
## [11] "MOTORIST.INJURED" "MOTORIST.KILLED"
## [13] "PEDESTRIANS.INJURED" "PEDESTRIANS.KILLED"
## [15] "PERSONS.INJURED" "PERSONS.KILLED"
## [17] "OFF.STREET.NAME" "ON.STREET.NAME"
## [19] "TIME" "UNIQUE.KEY"
## [21] "VEHICLE.TYPE.CODE.1" "VEHICLE.TYPE.CODE.2"
## [23] "ZIP.CODE" "CONTRIBUTING.FACTOR.VEHICLE.3"
## [25] "VEHICLE.TYPE.CODE.3" "CROSS.STREET.NAME"
## [27] "CONTRIBUTING.FACTOR.VEHICLE.4" "VEHICLE.TYPE.CODE.4"
## [29] "CONTRIBUTING.FACTOR.VEHICLE.5" "VEHICLE.TYPE.CODE.5"
4 variables will be dropped. Dataframe will have less columns, what will accelerate calculations on computer.
length(unique(data$UNIQUE.KEY)) == nrow(data)
## [1] TRUE
data<-subset(data, select=-c(UNIQUE.KEY, ZIP.CODE, LOCATION.COORDINATES, LOCATION.TYPE))
colnames(data)
## [1] "BOROUGH" "CONTRIBUTING.FACTOR.VEHICLE.1"
## [3] "CONTRIBUTING.FACTOR.VEHICLE.2" "DATE"
## [5] "LATITUDE" "LONGITUDE"
## [7] "CYCLIST.INJURED" "CYCLIST.KILLED"
## [9] "MOTORIST.INJURED" "MOTORIST.KILLED"
## [11] "PEDESTRIANS.INJURED" "PEDESTRIANS.KILLED"
## [13] "PERSONS.INJURED" "PERSONS.KILLED"
## [15] "OFF.STREET.NAME" "ON.STREET.NAME"
## [17] "TIME" "VEHICLE.TYPE.CODE.1"
## [19] "VEHICLE.TYPE.CODE.2" "CONTRIBUTING.FACTOR.VEHICLE.3"
## [21] "VEHICLE.TYPE.CODE.3" "CROSS.STREET.NAME"
## [23] "CONTRIBUTING.FACTOR.VEHICLE.4" "VEHICLE.TYPE.CODE.4"
## [25] "CONTRIBUTING.FACTOR.VEHICLE.5" "VEHICLE.TYPE.CODE.5"
Conversion of variables into correct data types.
data$BOROUGH<-as.factor(data$BOROUGH)
data$TIME<-as.POSIXct(data$TIME,format="%H:%M")
data$TIME<-format(data$TIME, format = "%H:%M")
data$LATITUDE<-as.numeric(data$LATITUDE)
data$LONGITUDE<-as.numeric(data$LONGITUDE)
data$YEAR<-year(data$DATE)
data$MONTH<-month(data$DATE)
data$WEEKDAY<-weekdays(data$DATE)
data$HOUR<-format(as.POSIXct(data$TIME,format="%H:%M"),"%H")
col_names_vect<-colnames(data)
factor_vehicle<-col_names_vect[grepl("FACTOR", col_names_vect)]
vehicle_type<-col_names_vect[grepl("TYPE", col_names_vect)]
street<-vehicles<-col_names_vect[grepl("STREET", col_names_vect)]
injured<-vehicles<-col_names_vect[grepl("INJURED", col_names_vect)]
killed<-vehicles<-col_names_vect[grepl("KILLED", col_names_vect)]
#create vectors with factor and numeric column names
factor_col<-c(factor_vehicle, vehicle_type, street)
numeric_col<-c(injured,killed)
#change type
data[factor_col] <- lapply(data[factor_col], as.factor)
data[numeric_col] <- lapply(data[numeric_col], as.integer)
str(data)
## 'data.frame': 1185165 obs. of 30 variables:
## $ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 4 NA 3 4 NA 4 4 3 2 2 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 49 levels "Accelerator Defective",..: 15 9 47 47 47 9 18 9 14 14 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 48 levels "Accelerator Defective",..: 46 46 46 46 46 46 NA 9 NA 46 ...
## $ DATE : POSIXct, format: "2012-10-21" "2012-10-21" ...
## $ LATITUDE : num 40.7 40.8 40.8 40.8 NA ...
## $ LONGITUDE : num -73.8 -73.9 -74 -73.9 NA ...
## $ CYCLIST.INJURED : int 0 0 0 0 0 1 0 0 0 0 ...
## $ CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MOTORIST.INJURED : int 2 3 0 0 0 0 0 0 0 0 ...
## $ MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PEDESTRIANS.INJURED : int 0 0 0 0 0 0 2 0 1 0 ...
## $ PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PERSONS.INJURED : int 2 3 0 0 0 1 2 0 1 0 ...
## $ PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OFF.STREET.NAME : Factor w/ 16165 levels " ",..: 152 NA 15537 11037 NA 8226 1742 4415 5986 291 ...
## $ ON.STREET.NAME : Factor w/ 10180 levels "\177estfarms road ",..: 8727 NA 2464 7017 NA 2099 5176 9395 1737 1276 ...
## $ TIME : chr "11:30" "11:30" "11:50" "11:50" ...
## $ VEHICLE.TYPE.CODE.1 : Factor w/ 180 levels "315 e","3D","4DSD",..: 132 96 96 96 96 18 132 132 132 96 ...
## $ VEHICLE.TYPE.CODE.2 : Factor w/ 199 levels "15 PA","1S","3D",..: 148 113 113 113 113 113 NA 148 NA 113 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 43 levels "Accelerator Defective",..: NA 42 NA NA NA NA NA NA NA NA ...
## $ VEHICLE.TYPE.CODE.3 : Factor w/ 48 levels "3D","4DSD","AM",..: NA 25 NA NA NA NA NA NA NA NA ...
## $ CROSS.STREET.NAME : Factor w/ 90293 levels " ",..: NA NA NA NA 87938 NA NA NA NA NA ...
## $ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 42 levels "Accelerator Defective",..: NA NA NA NA NA NA NA NA NA NA ...
## $ VEHICLE.TYPE.CODE.4 : Factor w/ 35 levels "4DSD","AMBULANCE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 32 levels "Aggressive Driving/Road Rage",..: NA NA NA NA NA NA NA NA NA NA ...
## $ VEHICLE.TYPE.CODE.5 : Factor w/ 23 levels "AMBULANCE","BICYCLE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ YEAR : num 2012 2012 2012 2012 2012 ...
## $ MONTH : num 10 10 10 10 10 10 10 10 10 10 ...
## $ WEEKDAY : chr "Sunday" "Sunday" "Sunday" "Sunday" ...
## $ HOUR : chr "11" "11" "11" "11" ...
Let’s check if in our dataset has unusual values.
summary(data)
## BOROUGH CONTRIBUTING.FACTOR.VEHICLE.1
## BRONX :113118 Unspecified :572508
## BROOKLYN :261275 Driver Inattention/Distraction:173802
## MANHATTAN :214354 Failure to Yield Right-of-Way : 56449
## QUEENS :221896 Fatigued/Drowsy : 48482
## STATEN ISLAND: 39090 Backing Unsafely : 37359
## NA's :335432 (Other) :291185
## NA's : 5380
## CONTRIBUTING.FACTOR.VEHICLE.2
## Unspecified :869962
## Driver Inattention/Distraction: 48076
## Other Vehicular : 20532
## Fatigued/Drowsy : 13027
## Failure to Yield Right-of-Way : 10676
## (Other) : 62735
## NA's :160157
## DATE LATITUDE LONGITUDE
## Min. :2012-07-01 00:00:00 Min. : 0.00 Min. :-201.36
## 1st Qu.:2013-12-17 00:00:00 1st Qu.:40.67 1st Qu.: -73.98
## Median :2015-05-26 00:00:00 Median :40.72 Median : -73.93
## Mean :2015-05-03 11:08:16 Mean :40.72 Mean : -73.92
## 3rd Qu.:2016-09-17 00:00:00 3rd Qu.:40.77 3rd Qu.: -73.87
## Max. :2017-12-31 00:00:00 Max. :41.13 Max. : 0.00
## NA's :213471 NA's :213471
## CYCLIST.INJURED CYCLIST.KILLED MOTORIST.INJURED MOTORIST.KILLED
## Min. :0.00000 Min. :0.00e+00 Min. : 0.0000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00e+00 1st Qu.: 0.0000 1st Qu.:0.000000
## Median :0.00000 Median :0.00e+00 Median : 0.0000 Median :0.000000
## Mean :0.02054 Mean :8.44e-05 Mean : 0.1851 Mean :0.000457
## 3rd Qu.:0.00000 3rd Qu.:0.00e+00 3rd Qu.: 0.0000 3rd Qu.:0.000000
## Max. :4.00000 Max. :2.00e+00 Max. :43.0000 Max. :5.000000
##
## PEDESTRIANS.INJURED PEDESTRIANS.KILLED PERSONS.INJURED
## Min. : 0.00000 Min. :0.000000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.:0.000000 1st Qu.: 0.0000
## Median : 0.00000 Median :0.000000 Median : 0.0000
## Mean : 0.05195 Mean :0.000663 Mean : 0.2566
## 3rd Qu.: 0.00000 3rd Qu.:0.000000 3rd Qu.: 0.0000
## Max. :27.00000 Max. :6.000000 Max. :43.0000
##
## PERSONS.KILLED OFF.STREET.NAME
## Min. :0.000000 : 33212
## 1st Qu.:0.000000 3 AVENUE : 10493
## Median :0.000000 BROADWAY : 10296
## Mean :0.001206 2 AVENUE : 8952
## 3rd Qu.:0.000000 5 AVENUE : 7444
## Max. :8.000000 (Other) :827699
## NA's :287069
## ON.STREET.NAME TIME
## : 25765 Length:1185165
## BROADWAY : 12390 Class :character
## ATLANTIC AVENUE : 10877 Mode :character
## NORTHERN BOULEVARD : 8622
## 3 AVENUE : 8434
## (Other) :888594
## NA's :230483
## VEHICLE.TYPE.CODE.1
## PASSENGER VEHICLE :679753
## SPORT UTILITY / STATION WAGON:286044
## TAXI : 46364
## VAN : 26470
## OTHER : 23972
## (Other) :113667
## NA's : 8895
## VEHICLE.TYPE.CODE.2
## PASSENGER VEHICLE :511567
## SPORT UTILITY / STATION WAGON:217064
## UNKNOWN : 81453
## TAXI : 38724
## BICYCLE : 25874
## (Other) :131502
## NA's :178981
## CONTRIBUTING.FACTOR.VEHICLE.3
## Unspecified : 70611
## Other Vehicular : 1410
## Driver Inattention/Distraction: 1268
## Fatigued/Drowsy : 1122
## Following Too Closely : 416
## (Other) : 2174
## NA's :1108164
## VEHICLE.TYPE.CODE.3
## PASSENGER VEHICLE : 61597
## SPORT UTILITY / STATION WAGON: 31371
## UNKNOWN : 3285
## TAXI : 3087
## PICK-UP TRUCK : 2170
## (Other) : 5526
## NA's :1078129
## CROSS.STREET.NAME
## : 64094
## PARKING LOT 110-00 ROCKAWAY BOULEVARD : 150
## 772 EDGEWATER ROAD : 131
## 110-00 ROCKAWAY BOULEVARD : 95
## PARKING LOT-772 EDGEWATER RD : 91
## (Other) : 118341
## NA's :1002263
## CONTRIBUTING.FACTOR.VEHICLE.4
## Unspecified : 15322
## Other Vehicular : 253
## Fatigued/Drowsy : 222
## Driver Inattention/Distraction: 204
## Following Too Closely : 93
## (Other) : 490
## NA's :1168581
## VEHICLE.TYPE.CODE.4
## PASSENGER VEHICLE : 24309
## SPORT UTILITY / STATION WAGON: 14002
## TAXI : 1666
## PICK-UP TRUCK : 1267
## BICYCLE : 1059
## (Other) : 1670
## NA's :1141192
## CONTRIBUTING.FACTOR.VEHICLE.5
## Unspecified : 3805
## Other Vehicular : 59
## Fatigued/Drowsy : 48
## Driver Inattention/Distraction: 39
## Pavement Slippery : 24
## (Other) : 114
## NA's :1181076
## VEHICLE.TYPE.CODE.5 YEAR MONTH
## PASSENGER VEHICLE : 5358 Min. :2012 Min. : 1.000
## SPORT UTILITY / STATION WAGON: 3034 1st Qu.:2013 1st Qu.: 4.000
## TAXI : 240 Median :2015 Median : 7.000
## PICK-UP TRUCK : 206 Mean :2015 Mean : 6.872
## UNKNOWN : 94 3rd Qu.:2016 3rd Qu.:10.000
## (Other) : 254 Max. :2017 Max. :12.000
## NA's :1175979
## WEEKDAY HOUR
## Length:1185165 Length:1185165
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
From this summarision, we can notice that we have plenty of missing values, however these will be analyzed later. LONGITUDE variable has several values -200, which are our of the range - Longitude for NYC is around -74.
As we can see on above summary - there are plenty of missing values in this dataset. Visualization below represents amount of missing values in every column.
missing_plot<-data %>% summarize_all(funs(sum(is.na(.))/length(.))) %>% gather %>% ggplot(aes(x = reorder(key, value), y = value)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() + xlab("Variables") + ylab("Missing values percentage")
ggplotly(missing_plot, tooltip=c("y"))
We can notice that over 90% of data is missing for following columns:
VEHICLE.TYPE.CODE.3
VEHICLE.TYPE.CODE.4
VEHICLE.TYPE.CODE.5
CONTRIBUTING.FACTOR.VEHICLE.3
CONTRIBUTING.FACTOR.VEHICLE.4
CONTRIBUTING.FACTOR.VEHICLE.5
This shows that less than 10% of collisions in NYC contain more than 2 vehicles. These 6 columns will be dropped. Moreover, We can notice that columns such as LATITUDE and LONGITUDE have the same amount of missing value with other columns.
missing_data<-data[is.na(data$LATITUDE),]
print(paste0("There are ",nrow(missing_data)," missing observations"))
## [1] "There are 213471 missing observations"
colMeans(is.na(missing_data))
## BOROUGH CONTRIBUTING.FACTOR.VEHICLE.1
## 1 1
## CONTRIBUTING.FACTOR.VEHICLE.2 DATE
## 1 1
## LATITUDE LONGITUDE
## 1 1
## CYCLIST.INJURED CYCLIST.KILLED
## 1 1
## MOTORIST.INJURED MOTORIST.KILLED
## 1 1
## PEDESTRIANS.INJURED PEDESTRIANS.KILLED
## 1 1
## PERSONS.INJURED PERSONS.KILLED
## 1 1
## OFF.STREET.NAME ON.STREET.NAME
## 1 1
## TIME VEHICLE.TYPE.CODE.1
## 1 1
## VEHICLE.TYPE.CODE.2 CONTRIBUTING.FACTOR.VEHICLE.3
## 1 1
## VEHICLE.TYPE.CODE.3 CROSS.STREET.NAME
## 1 1
## CONTRIBUTING.FACTOR.VEHICLE.4 VEHICLE.TYPE.CODE.4
## 1 1
## CONTRIBUTING.FACTOR.VEHICLE.5 VEHICLE.TYPE.CODE.5
## 1 1
## YEAR MONTH
## 1 1
## WEEKDAY HOUR
## 1 1
data<-data[!is.na(data$LATITUDE),]
missing_plot2<-data %>% summarize_all(funs(sum(is.na(.))/length(.))) %>% gather %>% ggplot(aes(x = reorder(key, value), y = value)) + geom_bar(stat = "identity", fill = "green") + coord_flip() + xlab("Variables") + ylab("Missing values percentage")
missing_plot2
col_names_vect<-colnames(data)
multicars1<-col_names_vect[grepl("VEHICLE.3", col_names_vect)]
multicars2<-col_names_vect[grepl("VEHICLE.4", col_names_vect)]
multicars3<-col_names_vect[grepl("VEHICLE.5", col_names_vect)]
multicars4<-col_names_vect[grepl("CODE.3", col_names_vect)]
multicars5<-col_names_vect[grepl("CODE.4", col_names_vect)]
multicars6<-col_names_vect[grepl("CODE.5", col_names_vect)]
latitude<-col_names_vect[grepl("LATITUDE", col_names_vect)]
longitude<-col_names_vect[grepl("LONGITUDE", col_names_vect)]
data_copy=data
multi_collision_with_location<-c(multicars1, multicars2, multicars3, multicars4, multicars5, multicars6, latitude, longitude)
multi_collision_without_location<-multi_collision_with_location [! multi_collision_with_location %in% c(latitude, longitude)]
data_multi_collision_with_location<-data[multi_collision_with_location]
data <- data[, !colnames(data) %in% multi_collision_without_location]
yearDistribution<-ggplot(data, aes(YEAR)) +
geom_bar(aes(fill = BOROUGH), position = "dodge", stat="count") +scale_fill_brewer(palette = "Set2") + xlab("Year") + ylab("Amount of collisions") + ggtitle("Number of Collisions Per Year for each Borough")
yearDistribution #+ geom_text(stat='count', aes(label=..count..), vjust=-1)
#ggplotly(yearDistribution, tooltip = c("y"))
2012 contains less collisions, because we retrieved data from 1st of July.
monthDistribution<-ggplot(data,aes(MONTH)) + geom_bar(colour='red', fill='blue', stat = "count") + ggtitle('Number of accidents in dataset for each Borough') + scale_x_discrete(limits = month.abb) + xlab("Month") + ylab("Amount of collisions") + scale_fill_brewer(palette = "Set1") +facet_wrap(~BOROUGH) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(monthDistribution, tooltip = c("y"))
df_heatmap<-data[,c("YEAR", "MONTH", "TIME")]
df_heatmap <- aggregate(TIME ~ YEAR + MONTH, data = df_heatmap, FUN = length)
names(df_heatmap) <- c("YEAR", "MONTH", "COUNT")
df_heatmap$MONTH <- as.factor(month.abb[df_heatmap$MONTH])
hm<-create_heatmap(df_heatmap)
hm + geom_text(aes(label=COUNT))
ggplotly(hm, tooltip = c("COUNT"))
df_heatmap_kill<-data[,c("YEAR", "MONTH", "PERSONS.KILLED")]
PERSONS.KILLED<-data$PERSONS.KILLED
df_heatmap_kill<-aggregate_heatmap(df_heatmap_kill, PERSONS.KILLED)
hm_kill<-create_heatmap(df_heatmap_kill)
hm_kill + geom_text(aes(label=COUNT))
#print(paste0("There have been ",sum(PERSONS.KILLED)," death accidents since 01/07/2012 until 31/12/2017")))
df_heatmap_injured <- data[,c("YEAR", "MONTH", "PERSONS.INJURED")]
PERSONS.INJURED <- df_heatmap_injured$PERSONS.INJURED
df_heatmap_injured <- aggregate_heatmap(df_heatmap_injured, PERSONS.INJURED)
hm_injured <- create_heatmap(df_heatmap_injured)
hm_injured + geom_text(aes(label=COUNT))
#print(paste0("There have been ",sum(data$PERSONS.INJURED)," death accidents since 01/07/2012 until 31/12/2017")))
weekdayPlot<-ggplot(data,aes(WEEKDAY)) + geom_bar(colour="black", fill="purple") + ggtitle("Accidents by weekday") +ylab('Accidents') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_discrete(labels = paste0(weekdays(Sys.Date()+4:10)))
weekdayPlot
#ggplotly(weekdayPlot)
weekdayPlot + facet_wrap(~BOROUGH)
plotHour<-ggplot(data) + geom_line(aes(HOUR),group = 1,stat = "count", color = "steelblue") + xlab("Hour")
plotHour
plotHour + facet_wrap(~BOROUGH)
plotHour + facet_wrap(~YEAR)
plotHour + facet_wrap(~WEEKDAY)
library('ggmap')
map <- get_map("new york", zoom = 10)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=new+york&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=new%20york&sensor=false
ggmap(map) + geom_point(data = data, aes(x = LONGITUDE, y = LATITUDE, color=BOROUGH))
df_injured<-data[,c("LATITUDE","LONGITUDE","PEDESTRIANS.INJURED","CYCLIST.INJURED","MOTORIST.INJURED")] %>% gather(TYPE,VALUE,3:5) %>% na.omit() %>% group_by(LATITUDE,LONGITUDE,TYPE) %>% summarise(TOTAL=sum(VALUE,na.rm=T)) %>% filter(TOTAL!=0)
type<-"PEDESTRIANS.INJURED"
pedestrianInjured <-geo_plot(map,df_injured,type)
pedestrianInjured
filteredData<-df_injured %>% filter(TOTAL >10)
type<-"PEDESTRIANS.INJURED"
pedestrianInjured1 <-geo_plot_2(map,filteredData,type)
pedestrianInjured1
type<-"CYCLIST.INJURED"
cyclistInjured <-geo_plot(map,df_injured,type)
cyclistInjured
filteredData<-df_injured %>% filter(TOTAL > 15)
type<-"CYCLIST.INJURED"
cyclistInjured1 <-geo_plot_2(map,filteredData,type)
cyclistInjured1
type<-"MOTORIST.INJURED"
motoristInjured <-geo_plot(map,df_injured,type)
motoristInjured
## Warning: Removed 1 rows containing missing values (geom_point).
filteredData<-df_injured %>% filter(TOTAL > 120)
type<-"MOTORIST.INJURED"
motoristInjured1 <-geo_plot_2(map,filteredData,type)
motoristInjured1
df_killed <- data[,c("LATITUDE","LONGITUDE","PEDESTRIANS.KILLED","CYCLIST.KILLED","MOTORIST.KILLED")] %>% gather(TYPE,VALUE,3:5) %>% na.omit() %>% group_by(LATITUDE,LONGITUDE,TYPE) %>% summarise(TOTAL=sum(VALUE,na.rm=T)) %>% filter(TOTAL!=0)
type<-"PEDESTRIANS.KILLED"
pedestrianKilled <-geo_plot(map,df_killed,type)
pedestrianKilled
filteredData<-df_killed %>% filter(TOTAL > 3)
type<-"PEDESTRIANS.KILLED"
pedestrianKilled1 <-geo_plot_2(map,filteredData,type)
pedestrianKilled1
type<-"CYCLIST.KILLED"
cyclistKilled <-geo_plot(map,df_killed,type)
cyclistKilled
filteredData<-df_killed %>% filter(TOTAL > 1)
type<-"CYCLIST.KILLED"
cyclistKilled1 <-geo_plot_2(map,filteredData,type)
cyclistKilled1
cyclistKilled1
type<-"MOTORIST.KILLED"
motoristKilled <-geo_plot(map,df_killed,type)
motoristKilled
filteredData<-df_killed %>% filter(TOTAL > 3)
type<-"MOTORIST.KILLED"
motoristKilled <-geo_plot_2(map,filteredData,type)
motoristKilled
factorData <- data_copy %>% select(PERSONS.KILLED,PERSONS.INJURED,CONTRIBUTING.FACTOR.VEHICLE.1,CONTRIBUTING.FACTOR.VEHICLE.2,CONTRIBUTING.FACTOR.VEHICLE.3,CONTRIBUTING.FACTOR.VEHICLE.4,CONTRIBUTING.FACTOR.VEHICLE.5) %>% gather(TYPE,VALUE,1:2) %>% gather(VEHICLE_TYPE,CAUSE,1:5) %>% filter(VALUE!=0,CAUSE!="",CAUSE!="Unspecified")
## Warning: attributes are not identical across measure variables;
## they will be dropped
factorData1 <- factorData %>% select(-VEHICLE_TYPE) %>% group_by(TYPE,CAUSE) %>% summarise(TOTAL=sum(VALUE,na.rm=T))
cause_plot(factorData1)
cause <- "Unsafe Speed"
causeLocation<- create_cause_dataset(data_copy,cause)
df1<-causeLocation[,c("LATITUDE", "LONGITUDE", "BOROUGH")]
ggmap(map) + geom_point(data = df1, aes(x = LONGITUDE, y = LATITUDE, color=BOROUGH), size = 3, shape = 1)
locationDF<-data[,c("LATITUDE","LONGITUDE")]
locationDF<-locationDF[complete.cases(locationDF),]
km<-kmeans(locationDF, 5)
locationDF$CLUSTER<-as.factor(km$cluster)
ggmap(map) + geom_point(data = locationDF, aes(x = LONGITUDE, y = LATITUDE, color=CLUSTER), size = 3, shape = 1)
## Warning: Removed 8 rows containing missing values (geom_point).
num1<-subset(num, select=-c(PERSONS.KILLED, PERSONS.INJURED))
km<-kmeans(num1, 3)
num1$CLUSTER<-as.factor(km$cluster)
ggmap(map) + geom_point(data = num1, aes(x = LONGITUDE, y = LATITUDE, color=CLUSTER), size = 3, shape = 1)
## Warning: Removed 8 rows containing missing values (geom_point).
This comprehensive analysis provided answers for questions below:
1. When collisions happen the most frequently? Is it correlated with specific months or weekdays?
Collisions happen mostly frequently in the middle of the week.
Time: there is a high amount of collisions at 8AM and 4PM. The biggest amount of collisions happen, when people come back from work. However, we can notice that amount of collisions in night increases during weekend.
2. Where collisions appear the most frequently? Which boroughs are the most dangerous?
The biggest amount of accidents we can notice in downtown of Manhattan and the smallest amount - in Staten Island.
3. What are the main factors that contribute collisions in NYC?
Main factors are: Distraction, Failure to Yield Right-of-Way and Traffic Control Disregarded.